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genomic parsimony 
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Abstract: High-dimensional data models, often with low sample size, abound 
in many interdisciplinary studies, genomics and large biological systems being 
most noteworthy. The conventional assumption of multinormality or linear- 
ity of regression may not be plausible for such models which are likely to be 
statistically complex due to a large number of parameters as well as various un- 
derlying restraints. As such, parametric approaches may not be very effective. 
Anything beyond parametrics, albeit, having increased scope and robustness 
perspectives, may generally be baffled by the low sample size and hence un- 
able to give reasonable margins of errors. Kendall's tau statistic is exploited 
in this context with emphasis on dimensional rather than sample size asymp- 
totics. The Chen-Stein theorem has been thoroughly appraised in this study. 
Applications of these findings in some microarray data models are illustrated. 
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1. Introduction 

The past three decades have witnessed a phenomenal growth of research litera- 
ture on statistical methods for large dimensional data models. Such models abound 
in various interdisciplinary fields, especially in the evolving field of genomics and 
bioinformatics. Knowledge discovery and data mining (KDDM) or statistical learn- 
ing tools are usually advocated for such high dimensional data models, often on 
primarily computational or heuristic justifications. The curse of dimensionality is 
so overwhelming that classical likelihood (principle) based statistical inference tools, 
baffled with an excessive number of parameters, may not be robust or efficient. Con- 
ventional assumptions of multinormality of errors and linearity of regression models 
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may not be generally tenable in such contexts. Moreover, having a large number 
of coordinate variables, the assumption of their stochastic independence may not 
be realistic in a majority of cases. On top of that, at least a part of the response 
variables may be discrete or even purely qualitative in nature; often, the categorical 
responses may not reveal any (partial) ordering. In that sense, discrete multivariate 
analysis may appear to be more appropriate than conventional multinomial model 
based analysis. Even for multinormal models, the high-dimensionality may demand 
a far larger sample size in order to implement a full likelihood based asymptotic 
analysis. That is, wc need the conventional n 3> K environment for drawing appro- 
priate statistical conclusions with reasonable precision. 

Typically, in such high-dimensional models, one encounters a K ^ n environ- 
ment, where K is the dimension of the data and n is the sample size. In such high- 
diensional low sample size, HDLSS, models, effective dimension reduction may be 
a challenging statistical task, usually beyond the scope of KDDM. For example, in 
neuronal spike train models, there are literally tens of thousands of neurons (nerve 
cells), and in the presence of external stimuli, the spike trains for any observable 
subset of neurons exhibit a high-degree of nonstationarity. Further, recording of 
such spike trains in a large number of nerve cells may be invasive to the brain 
functioning due to the destructive nature of recording ([Id], Ch. 3). Essentially, we 
have a very high dimensional counting process. Doubly stochastic Poisson processes 
have been considered in the literature, albeit without much claim of optimal reso- 
lutions. In magnetic resonance imaging, MRI, there could be tens of thousands of 
microscopic units producing an enormously high dimensional spatial data model. 
More complexities may arise in case of (functional) fMRI models. For such HDLSS 
models, parametric asymptotics may not have adequate scope or good statistical 
interpretation. 

The transition from conventional normal theory to nonparametric linear models 
has been well fortified along with the development of nonparametric or robust 
statistical methods based on R- statistics (ranks), Af-statistics (maximization) and 
linear combinations of order statistics or L-statistics; see, for example, [8] where 
other pertinent references have been extensively cited. In a more general setup, 
nonparametric regression functionals have been formulated wherein the linearity of 
regression or a specific nonlinear form are not assumed to hold. 

In the context of testing monotonicity of nonparametric regression, without as- 
suming a linear or any specific nonlinear form, Ghosal et al. [-5] considered suitable 
CZ-processes based on a locally smoothed Kendall's tau statistic. They provided gen- 
eral asymptotics for such locally smoothed Kendall's tau processes when both the 
independent and dependent variates arc stochastic, and illustrated their effective 
use in the postulated hypothesis testing problem. Such local versions of Kendall's 
tau statistics have simple statistical interpretation, albeit, in view of possibly slower 
rate of convergence, the impact of large sample size is apparent in their analysis. 
In the contemplated bioinformatics area, as we shall see, the HDLSS scenario calls 
for alternative approaches, and some of these will be explored in this study. 

In a simple regression setup, the Theil-Sen (point as well as interval) estimates of 
the regression slope based on the Kendall tau statistic [15], have simple forms, and 
are computationally tractable and statistically robust. Another advantage of the 
Kendall tau statistic is its adaptability for count data as well as latent-effect models. 
Further, a test for the null hypothesis of no regression based on the Kendall tau 
statistic (being distribution-free under the null hypothesis of invariance) remains 
valid and efficient for such complex models. Our contemplated models, unlike [.">], 
entail a high dimensional data with relatively (and often inadequately) smaller 
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sample size, i.e., the HDLSS {K ^ n) environment. As we shall see in the next 
section, there may not be a genuine temporal pattern. In addition, there may be 
other complications arising from lack of spatial-compactness, spatial homogeneity 
and other spatial dependence patterns. 

For better motivation, in Section 2, an illustration is made with a microarray data 
model where HDLSS models typically arise. Section 3 deals with the appropriateness 
of statistical modeling and analysis based on a pseudo-marginal approach incorpo- 
rating coordinatewise construction of the Kendall tau statistic, in such K ^ n 
environments. Section 4 is devoted to the dimensional asymptotics for the Kendall 
tau process in such HDLSS models where there arc two basic problems : (i) group 
divergence, and (ii) classification of genes into disease and nondisease types. For 
the first problem, a pseudo-marginal approach based on the Hamming distance 
has been explored in [ ! ^] while in the latter context, multiple hypotheses testing 
(MHT) problems in HDLSS setups arise in a different perspective and call for some 
alternative novel tools for valid and efficient statistical appraisals. Motivated by 
these perspectives in such HDLSS models, some applications of the Chen-Stein 
[3] theorem in such K ^ n environments are presented in the last section. These 
generalizations cover both the MHT and the gene-environment interaction testing 
problems. 

2. An illustrative data model 

We consider a genomic model arising in microarray data analysis as an illustration. 
The microarray technology allows simultaneous studies of thousands of genes, K, 
possibly differentially expressed under diverse biological/experimental setups, with 
only a few, n, arrays. We may refer to Lobenhofer et al. [11] where for a set of 1900 
genes, arranged in rows, the gene expressions were recorded at 6 time points, with 
8 observations at each time point. Thus 1900 — K ^ ji — 48. The gene-expression 
levels are measured by their color intensity (or luminosity) as a quantitative (non- 
negative) variable, either on the (0, 1) or 0-100 per cent scale, or (based on the 
log-scale) on the real line 3?. A gene associated (causally or statistically) with a tar- 
get disease is known as a disease gene, DG, while the others as nondisease genes, 
NDG. Gene expression levels under different environments cast light on plausible 
gene- environment interactions (or associations) so that if the arrays are properly 
designed, mapping disease genes may be facilitated with such microarray studies. 
One of the main issues is identifying differentially expressed genes among thousands 
of genes, tested simultaneously, across experimental conditions. Typically, for a tar- 
get disease, there are only a few DG while the NDG comprise the vast majority. 
A NDG is expected to have a low gene expression level while a DG is expected to 
have generally higher expression levels. Thus, a natural stochastic ordering of gene 
expression levels of the DG with varying disease severity is plausible while the NDG 
expression levels are expected to be stochastically unaffected by such disease level 
differentials. 

Microarray data go thorough a lot of standardization and normalization so that 
conventional simple models, such as the classical MANOVA models, may rarely be 
totally adaptable. If the arrays are indexed by an explanatory or design variate 
(t) that possesses an ordering (not necessarily linear), then the stochastic ordering 
could be exploited through suitable nonparametric techniques. The main difficulty 
in modeling and statistically analyzing microarray data stems from the high di- 
mensionality of the genes compared to the number of arrays. While the different 
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arrays may sometimes be taken to be at least statistically independent, the genes 
may not. Moreover, not much is known about the spatial topology of the genes 
or their genetic distances. There is another factor that merits our attention. The 
gene expression levels for the different genes in an array are neither expected to 
be stochastically independent nor (marginally) identically distributed. Sans such 
an i.i.d. clause, standard parametrics typically adaptable for fMRI models (albeit 
mostly done in a Bayesian coating) may encounter roadblocks for fruitful adapta- 
tion in microarray data models. Thus, structurally, such data models are different 
from those usually encountered in nonparamctric functional regression models. For 
this reason, a pseudo-marginal approach is highlighted here. This approach exploits 
the marginal nonparametrics fully and renders some useful modeling and analysis 
convenience. 

3. Some HDLSS formulations 

Motivated by microarray data models introduced in Section 2, we consider here a 
set of n arrays (sample observations) where there is a design variate ti associated 
with the ith array, for i = 1, • . • , n- Without loss of generality, we assume the ti are 
ordered, i.e., 

(3.1) ti < t2 < • • • < in, 

with at least one strict inequality. We do not, however, impose any linear or spe- 
cific parametric ordering of these design variates. The multisample (ordered alter- 
native) model is a particular case where n can be partitioned into / subsets of 
sizes ni,...,n/ such that within each subgroup, the ti arc the same while they 
are ordered over the / different subsets. For the iih array, corresponding to the K 
genes (positions), we have a gene expression level denoted by Xik, k = 1, . . . , K, 
so that we have vectors X.; ~ [Xn, . . . , Xi^)' , for i = I, . . . ,n. The joint distri- 
bution function of is denoted by i^i(x), x € JR^^. Further, for the fcth gene in 
the ith array, i.e., Xik, the marginal distribution is denoted by Fik{x), x £ for 
k = 1, . . . , K; i = 1, . . . ,n. For a given i, the Fik, k ^ 1, . . . , K may not be gener- 
ally the same, and moreover, the Xik, k = I, . . . , K may not be all stochastically 
independent. 

If a gene k is NDG and the ti reflect the variability of the disease level, then the 
Fik, i = I, ■ ■ ■ ,n should be the same. On the other hand, for a DG k, for i < i' , Xik 
should be stochastically smaller than Xiik in the sense that the Fik,i = 1, . . . ,n 
should have the ordering 

(3.2) Fik{x) > F2k{x) >■■■> F„fc(a;),V x e 5R. 

Therefore, we could force a characteriation of DG and NDG based on the following 
stochastic ordering: For a NDG k, the Fik, i = 1, . . . , ti are all the same, this being 
denoted by the null hypothesis i?ofe, while for a DG k, the stochastic ordering in (3.2) 
holds which we denote by Hik, for k = 1, . . . ,K. In this marginal formulation, we 
have a set of K hypotheses corresponding to the K genes, and whatever appropriate 
test statistic (say T„fc) we use for testing ifofe vs. Hik, these statistics may not 
be, generally, stochastically independent. The basic problem is therefore to test 
simultaneously for 



(3.3) 



K K 

- n Hok vs i7i = U H^k, 
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without ignoring possible dependence of the test statistics for the component hy- 
potheses testing Hok vs Hik, for k = 1, . . . ,K. This makes it appealing to follow 
the general guidelines of the Roy [13] union-intersection principle (UIP), albeit in 
a marginalization (i.e., adapting a finite union and finite intersection scheme), and 
thus permitting a more general framework so as to allow simultaneous testing and 
classification into DG / NDG groups. In a very parametric setup, some order re- 
stricted inference problems have been considered by [12]. However, in our setup, 
such normality based parametric models may not be very appropriate. 

Our approach is based on the classical Kendall tau statistics for each of the K 
genes and the incorporation of these (possibly dependent) marginal statistics in a 
composite scheme for classification. For the fcth gene, based on the n observations 
Xik, i = 1, . . . , 71, and the tagging variables ti, . . . , i„, we define the Kendall tau 
statistic as 

(3.4) Tnk = ('^ sign(^»'fe - X,k)signit,, - 1,), 

^ l<i<i'<n 

for k = 1,.. .,K. Conventionally, we take sign(O) = 0. Note that Tnk is a (gen- 
eralised) J7-statistic of degree 2 [7]. Further, note that by (3.1), we may set S = 
{{i, i') : ti < ti' ; 1 < i < i' < n} and let TV be the cardinality of the set S. Then by 
(3.1), ri — 1 < < (2). Moreover, we may rewrite Tnk as 

(3.5) Tnk - Q 5^ sign(X,,fc - X,k), k = l,...,K, 

where S depends on the ordering of the tj and therefore remains the same for every 
k = 1, . . . ,K. Note further that whenever N < (2) , the range of variation of Tnk is 
( —N/ (2), N/ (2) ) which is contained in the interval (—1, 1). That is why we shall 
find it convenient to take the modified or rescaled Kendall tau as 

(3.6) T°, = J2 sign(^»'fc - X,k), 

s 

whose range is exactly (—1, 1), albeit the distribution being still discrete. 

Note that for any k ^ 1, . . . , K , under i/ofc, for every i 7^ i', the difference Xi/k — 
Xik is symmetrically distributed around 0, and hence, Eok{sign{Xiik — Xik)} = 
so that 

(3.7) Eok{Tnk} = Eok{T°k} = 0, V fc = 1, . . . , iC. 

Further, the marginal distribution of Tnk under _ffofc is generated by the n\ equally 
likely permutations of the Xik among themselves. Therefore when all the Fik are 
continuous, ties among the observations being negligible with probability 1, Tnk 
(or T°j.) is distribution-free under i/ofe- This distribution may depend on the set iS 
but that being the same for all k, we conclude that under in (3.3), marginally 
each T°i, is distribution-free and these K statistics all have the same marginal 
distribution. If all the ti were stochastic and continuous then N ~ (2) and we will 
have 

(3.8) Varo{T„fe} ^ 2{2n + 5)/{9n{n - 1)}. 

On the other hand, in general for N < (2), the ti are not distinct and may be even 
nonstochastic, and hence, the variance is equal to 



(3.9) 



Varo{r°J = iV-2{(2/3)(7Vi - N^) + N} = say, 



256 



P. K. Sen 



where A^i is the cardinahty of the set : ti < ti',ti < ti'i,tii ^ t,//} 

and -/V2 is the cardinahty of the set («",«) : ti > ti" ^ ti/}. For smaU vahies 

of n and given (3.1), one can enumerate S and obtain the exact distribution of 
T°j. under ifofc- If n is large, the standardized form of the statistic, i.e., T°f./vn has 
closely a standard normal distribution. In our setup, perhaps the exact permutation 
distribution plays a greater role and this will be illustrated later on. 

The behavior of T°j, under alternatives would naturally depend on the stochastic 
ordering in (3.2) and these statistics will not be exact distribution-free nor possibly 
have identical marginal laws. Nevertheless, under (3.2), for every i < i', Xi'k — Xik 
has a distribution tilted to the right, so that 



This motivates us to use tests based on the marginal statistics T°f. using the right 
hand side critical region, or cquivalently the right-hand sided p- values. Recall that 
the distribution of each r°j,, at least for n not too large, is discrete, but that is not 
going to be of any particular concern. A greater concern is to incorporate possible 
stochastic dependence among the K statistics r°j., k = 1, . . . (even under the 
null hypothesis) and their possible heterogeneity when some of the Hd^ are true. 
A basic problem is to formulate suitable multiple hypothesis testing procedures to 
assess which hypotheses are to be rejected subject to a suitably defined Type I error 
rate. This is elaborated in the next section. 

4. Dimensional asymptotics and the union intersection test 

Although independence across microarrays may be assumed, their i.d. structure 
may be vitiated if the arrays relate to different biological or experimental setups. 
Moreover, for different genes, the gene expression (marginal) distributions are likely 
to be different when there is gene-environment interaction. Taking into account such 
plausible inter-gene stochastic dependence and heterogeneity, we need to prescribe 
statistical modeling and analysis tools. This will be accomplished through dimen- 
sional asymptotics where K is made to increase indefinitely while n, being small 
compared to A', may or may not be adequately large. 

In view of (3.3), it is tempting to appeal to the union-intersection principle [1 
or UIP, to construct suitable test statistics which will cover the genome- wise picture 
in a reasonable way. Towards this, we may note that as under Ho (i.e., H^k^yk), 
marginally each has the same distribution (which docs not depend on the 
underlying Fik ). Thus, corresponding to any c ; — 1 < c < 1, the tail probability 
Pq{T°^, > c} is the same for all k and this can be evaluated by using the exact 
permutation distribution generated by the n\ permutations of the Xik, 1 < i < n. 
The UIP then leads to the following union-intersection test, UIT, statistic: 



where the test function is given by (f>{T*^) = 1,7, or 0, accordingly as T*° is >, = 
or < c and 7 : (0 < 7 < 1) is so chosen that EQ{(t){T*°)} = a, the preassigned level 
of significance. Note that for n not adequately large, the null distribution of 
is essentially discrete and hence this usual randomization test function is aimed to 
take care of this problem. 

The crux of the problem is therefore to determine such a critical level Cq,. The 
joint distribution of the T°^., 1 < fc < K, even under the null hypothesis Hq, depends 



(3.10) 



EiT^k I i/ifc }>0, Vfc=l,...,if. 



(4.1) 



T,r = max{T°, : 1 < fc < A'} 



Kendall's tau in high- dimension 



257 



on the underlying X-dimensional distribution Fi, and hence, in general will not be 
distribution-free. Thus, the usual technique of finding out the critical level of T*° 
from this joint distribution may be intractable. 

One possibility is to incorporate the fact that under i7o, the ii'- vectors Xi,i = 
1, . . . , n, are i.i.d. and hence their joint distribution remains invariant under any per- 
mutation of these vectors among themselves. Thereby we can evaluate such critical 
values by an to appeal to the permutation distribution generated by the n\ equally 
likely permutations of the A'-vectors {Xi} among themselves. This permutation 
law generates the (unconditional null) marginal laws of the T°^,, and provides some 
conditional versions of their joint distributions of various orders. Since this per- 
mutation law is a conditional law (given the collection of all these ii'-vectors) , the 
critical values obtained in this manner are themselves stochastic, thus introducing 
another layer of variation. Nevertheless, it provides a conditionally distribution- free 
test. One discouraging feature of this permutation approach is that the permuta- 
tion invariance does not hold under the alternative hypothesis, and hence critical 
levels computed from the permutation law involving an observed set of {X^} may 
be sensitive to the data conformity to the null situation. 

If we assume that all the are stochastically independent, then we have for 
any c, — 1 < c < 1, under iJp, 

(4.2) Po{rr < c} = [Po{T°i < c}]^, 

so that the distribution-free nature of the Tnk under the null hypothesis provides the 
access to the computation of the test function and the critical level. If n is at least 
moderately large, in view of the asymptotic normality of T°j./vn^ the randomization 
test function may be replaced by a conventional normal theory test function, where 
for the individual tests, a significance level a* is so chosen that 

(4.3) a = l-(l-a*)^. 

Generally, if we let a* — (a/K), then the size of the UIT is < a no matter whether 
the T°i. are stochastically independent or not. There is, therefore, a certain amount 
of conservativeness in this specification. 

In passing, we may remark that by the classical asymptotics on Hoeffding's 
{/-statistics, any pair (T'°j.,r°^), with /c ^ g, is a bivariate JJ-statistic, for a* suf- 
ficiently small, so using the bivariate extreme statistics results (viz., [19]), we can 
claim that the events {T'°^. > Cq.} and {T°^ > c^.} will be asymptotically (as 
K oo) independent so that Po{T°i^ > Ca*,T^g > c^*} can be well approximated 
by [Po{T°f. > Cq.-}]^. In a similar manner, the third order probability terms can 
be handled, and the Bonferroni bound retaining the second and third order proba- 
bilities provide a good approximation : a = Ka* — (^)q;*^ -|- {^)a*^ + o{a*^). As 
a result, a* = {a/K) provides a good approximation to the level of significance. 
Therefore, for the UIT, when K is large, even when the genes are not stochastically 
independent, letting a* = {a/K) we may consider the following multiple hypothesis 
testing scheme: 

For a chosen a* ~ K^^a, obtain the marginal distributional critical level Ca* , 
and reject those Hoki k G {1, . . . , K} for which the corresponding T°^, exceeds 

A randomization test function can be prescribed when n is not adequately large. 
Thus, the UIT provides a bound on the family wise error rate, FWER. If we take 
a* ^ a/K and K is large, we need to make sure that n is so large that Vn^Ca- < 
1; this will imply that if we are to use the permutation null distribution of any 
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T°j., being attracted by the permutational central limit theorem, it has a nonzero 
mass point beyond Ca'/vn- If — 0{n~^), as is typically the case, then Ca* = 
0(?i^^/^^/— 2 logo;*) so that log/\ = 0{n) and this docs not appear to be a serious 
concern in real life applications. For example, if wc have three groups of arrays, 
say within each group there are 5 arrays, the total number of partitioning 15 units 
into 3 subsets of 5 each is equal to (15)!/(5!)'^ and this is so large (756,756) that 
even if K is as large as 30,000, it would not be a problem. However, for large K , 
the UIT, like the classical likelihood ratio test, will have little power, and hence 
alternative test procedures need to be explored. This illustrates the important role 
of the design of the study and the number of arrays required in trying to include a 
very large K. 

Roy's UIT can be adapted by exploring the information contained in the ordered 
p-values. If the are all stochastically independent (and as they are identically 
distributed under the null hypothesis Hq) then one can adapt Simes' [20] theorem 
(which is a restatement of the classical Ballot theorem (viz., [9]) introduced some 
twenty years earlier). If Pi, . . . ,Pk are the p-values for the K marginal tests and 
Pra ^ ■ ■ • ^ Pk:K are the corresponding order statistics, then assuming that under 
Hq the Pk have a uniform (0, 1) distribution (i.e., tacitly assuming that the T°j^/vn 
have a continuous distribution under -ffo), Simes' theorem asserts that for every 
a : < a < 1, 

(4.4) P{PK:k > ka/K, y k^l,...,K \Ho} 
Suppose now we define the anti-ranks Si, ... , Sk by letting 

(4.5) PK:k^Ps,, k = l,...,K, 

where again ties among the ranks are neglected under the assumption of continuity 
of the distribution of the Pk- Whereas Simes' theorem provides a test of the overall 
hypothesis, Hochberg [ti] derived a step-up procedure for multiple hypotheses testing 
based on the following : For every a G (0, 1), 

(4.6) P{PK:k > a/iK -k + l),y k = l,...,K \Ho} = l-a. 

Benjamini and Hochberg [2] considered a step-up procedure based on the Simes 
theorem. Their multiple hypothesis testing procedure is the following: 

Reject those null hypotheses {HoSk} fo'"' which Pg^ < ka/K, k ~ 1, . . . , K, and 
accept those null hypotheses in the complementary set. 

For some related developments in a parametric setup, we refer to [2], [4], [10], [14] 
and [21], among others. 

These developments paved the way for other measures of error rates which are 
more adaptable in the K ^ n environment. Some of these will be discussed later 
on. There are two basic concerns that can be voiced in this respect. The whole setup 
is based on the assumed uniform distribution of the P^ under the null hypothesis. 
However, if we look into the statistics T°j, in our setup, we may note that though 
they have a specified distribution, the latter is a discrete one defined over the in- 
terval (—1, 1). Noting that there are a set of discrete mass points, ties among the 
T°f./vn (and hence Pk) can not be neglected with probability one, and moreover, 
the Pk will have a set of probability mass points on [0, 1] with non-zero masses. 
Thus, technically the above probability results are not strictly usable (unless n is 
indefinitely large, contradicting the K ^ n environment). Secondly, as was stressed 
earlier, the T°f. across the set of genes are generally not stochastically independent. 
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Controlling the FWER when K is very large may generally entail undue conserva- 
tiveness of multiple hypotheses testing schemes. On the other hand, using a level 
of significance for each marginal hypothesis testing problem may lead to a large 
FWER. 

In the context of microarrays suppose that there arc Ki disease genes (DG) and 
Kq = K — Ki NDG; thus, we have a set of Kq null hypotheses which are true and a 
complementary set of Ki hypotheses which arc not true. Suppose that based on our 
multiple hypotheses testing procedure, wc accept mg out of Kq true null hypothesis 
so that the remaining A'o — mp = nii true null hypotheses are rejected. Similarly, 
among the Ki not true null hypotheses, Iq arc accepted as true and li accepted in 
favor of the alternative. Thus, a totality of i? = mi + li hypotheses are rejected 
while K — R are accepted. Mind that though we observe i?, through our chosen 
multiple hypotheses testing procedure, individually mi,Zi are not observable; all 
these (i?, ZijTOi) are stochastic in nature. A natural modification of the FWER, to 
suit such K ^ n environments, is the per- comparison error rate (PCER) defined 
as 

(4.7) PCER = E{mi)/K, 

which is the expected proportion of Type I errors among the K hypotheses. A 
related measure is the per-family error rate (PFER), defined as 

(4.8) PFER = E{mi), 

which is the expected total number of Type I errors among the K hypotheses. 
Obviously, PFER ~ ii'.PCER, and is generally large when K is large (unless the 
PCER is very small). Moreover, 

(4.9) PFER ^ E{mi) = ^ rP{mi ^ r} > P{mi > 0}, 

so that PFER > FWER. 

If our observed R = then no true null hypothesis is rejected and hence there 
is no false discovery. For R > 1, the proportion of false discovery is given by 
Q = mi/R] conventionally, it is taken Q = when i? = 0, so that Q is properly 
defined for every nonnegative R and mi. However, Q is not observable. Hence, the 
false discovery rate (FDR) is defined as 

(4.10) FDR = E{Q} = ^P{R = r}E{mi/R\R^r}. 

r>l 

Since, conventionally, we have forced Q = for i? = 0, this definition of FDR may 
produce a negative bias. An alternative definition, known as the ^FDR, is defined 
as 

(4.11) PFI)R = E{Q\R> 0} = FDR/ P{R>1}. 

Naturally, ^FDR > FDR. 

In the formulation of FDR and ^FDR it is not necessary to assume that all 
of the test statistics have continuous distributions under the null hypothesis. If 
these distributions are all continuous then of course the p-values have a uniform 
(0, 1) distribution under the null hypothesis, and hence, the multiple hypotheses 
testing schemes discussed earlier can be conveniently adapted. In our setup, each 
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Fig 1. Comparison of the null distribution with the alternative distribution. 



test statistic has marginally the same null distribution, albeit that is discrete. So, 
it might be necessary, especially when n is not large, to make use of this otherwise 
completely specified, discrete distribution without assuming a uniform distribution 
for the associated p- values under the null hypothesis. 

We may simulate the permutation distribution of any marginal test statistics and 
thereby take into account possible dependence among the gene expressions without 
assuming any specific pattern. Of course, marginally, each test statistic has the 
same null distribution. So, if we consider the set {'r°j. : fc = 1, . . . , K} and define 
the empirical distribution 

K 

(4.12) GK{t) = J2 ^(^nk <t),te (-1, 1), 

fe=l 

then Eo{GK{t)} = G{t),yt E (—1, 1) where G{t) is the common marginal distribu- 
tion of the under the null hypothesis. The summands in Gxit) are all bounded 
variables, nondecreasing in i G (—1, 1) and G(t) is also nondecreasing and assumes 
values on (0, 1). Thus, whenever Gxit) stochastically converges pointwise to G(t), it 
does so uniformly in t € (—1, 1). Further Gxit) — G{t) is a bounded r.v., and hence, 
if it converges in probability, it converges in the rth mean for every r > 0. Therefore 
it might suffice to assume that the dependence pattern satisfies the condition: 

(4.13) Var(G'x(t)) ^ 0, as K oo. 

Then we conclude that II Gif(.)-G'(.) II = sup{\GKit)-G{t)\ : te (-1, 1)} stochasti- 
cally converges to 0. Further, (4.13) holds under quite general dependence patterns. 

It is naturally tempting to explore weak convergence (invariance principles) re- 
sults for VI{{Gk{-) — G{.)) wherein K is taken indefinitely large but not n. Since 
G{t),t € (—1,1) is a discrete distribution function with mass points over (—1,1), 
the jump-discontinuities of G(.) may vitiate the usual compactness (or tightness) 
properties possessed in the continuous case, albeit by strengthening (4.13) to 



(4.14) 



limsup^ /vVar(GK(t)) < oo,V i £ (-1,1), 
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pointwise, the asymptotic normality (as K —^ oo) follows under quite general de- 
pendency conditions. If we have some linear functional of G/f (.) as a test statistic, 
this weak convergence would have been quite useful in deriving the asymptotic (in 
K) normality of the test statistic under the null hypothesis; (4.14) would have been 
sufficient in that context. However, in our case, we have some functional of G/f (.), 
of extremal order statistic type, namely, the extreme quantiles of a set of depen- 
dent r.v.s, and hence we may need somewhat different regularity conditions. This 
perspective is appraised more elaborately in the next section. 

5. Dimensional asymptotics and Chen Stein theorem 

In the previous section we have briefly discussed the plausibility of some Kg NDG 
and Ki DG with Kg + Ki = K, the total number of genes. Neither Ki nor the 
DG positions are known and hence we have a dual problem of estimating Ki as 
well as identifying the positions of these A'l DG's. It is conceivable that the NDG 
having stochastically smaller expression levels (than the DG) and the stochastic 
dependence among the DG may not be insignificant. We intend to incorporate this 
stochastic dependence structure among the gene expressions in a suitable model. 
Unfortunately, sans any positional ordering of the K genes, it might be difficult to 
assume suitable mixing conditions under which central limit theorems may apply. 
As for considering alternative limit theorems for dependent sequences, we intend 
to incorporate the Chen-Stein theorem [.!] and its ramifications wherein Poisson 
approximations for more general dependent sequences are advocated. For our con- 
venience, let us state the Chen-Stein Theorem in a slightly updated version [1]. 

Theorem 1. (Chen-Stein): Let 2 be an index set with elements i 2 and let K be 
the cardinality of the set X. For each i ^ T let Yi be an indicator random variable 
and let 



Let W = X^igi^i total number of occurrence of the events {Yi = 1}, i S 
X, and let Xk = Tlin^iPKi = E{W). For each i X, we define a set Ji ^ X 
and its complement Jf; as the set of dependence of i and its complement, set of 
independence ofi. Thus, it is tacitly assumed thatYi is independent of{Yj, j e J^i}, 
for every i G X. Further, let 



(5.1) 



1} = 1 - P{Y, ^ 0} = PK^, I ex. 



^ ^ E{Y,)E{Y,); 



iei jeJi 



(5.2) 




(5.3) 





ieij{^i)ej. 



and 



(5.4) 



63 = 5]£;|{i?(F. - e{y,)\{y,Xj e jm- 

i€l 
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Finally, let Z be a random variable having Poisson distribution with parameter 
E{Z) = Xk. Then 

\\CiW)-C{Z)\\ < 2(&i+&2 + &3)i^T^ 

(5.5) < 2{bi+b2 + b3)miii{l,Xj}}. 
A direct corollary to Theorem 1 is the following: 

(5.6) \P{W ^Q}-e-^''\ < 2(&i + &2 +63)min{l,A^i}. 

An interesting feature of this Theorem is the dual control of Xk , the expectation and 
61,62, and 63, the dependence functions. In line with our intended application we 
consider a natural extension of this result. With the same notation as in Theorem 
1, we replace the Yi,i € 1, by a sequence of processes Yi{t),i £ l,t e T, where 
T = (0, a), for some a > 0, and assume that for each i, Yi{t) is nondecreasing in t 
and yet a zero-one valued random variable. Further assume that the sets jTi do not 
depend on i S T. For every i £l, < G T, we denote by pKi{t) = E{Yi{t)), and the 
corresponding parameters by Aif(i), 61 (t), 62(0 and b3{t). Let Wk = {WK{t),t € 
T} be the sum process and corresponding to Z, we introduce a Poisson process 
Zk = {ZK{t),t G T} whose expectation process is {A^ = {Ax(t),i S T}. Then 

\\C{Wk) " CiZK)\\ < 2 sup{(6i(t) + b2{t) + 63(0) m'*^ --teT}. 

The proof of this extension is along the lines of Theorem 1 and hence we omit the 
details. 

In our study, unless n is large, we may not have a continuous time parameter 
(t G T). Thus, we consider an intermediate result that remains applicable for small 
n as well. 

Theorem 2. Consider a set of M discrete time points —1 < ti < • • • < tj\/ < 1 
with respective probability masses rjni, ■ ■ ■ jrjnAi where M may depend on n. Also, 
let = J2i<j Vni, j = 1, . . . , M. Further, let Yi{Tj),i = 1, ... A', j = 1, . . . , M 
be an array of zero- one valued random variables where Yi{Tj) is nondecreasing in 
Tj and E(Y,{tj)) = Vnj. j = 1,...,A/. Define Wk = {I^k(tj),j = l,...,Af} 
where WxiTj) = Y^^=iYi{Tj) for j = 1,...,M. Similarly, let Zk = {ZK{Tj),j = 
1, . . . , A/} be a discrete time parameter Poisson process with the drift function vk = 
{vn],i = 1, . . . , M}. Define the parameters bKi{Tj),bK2{Tj), bxziTj), j = l,...,M 
as in (5.2), (5.3), and (5.4); assume that as K — > 00, 

(5.7) max{(6/^i(Tj) + 6x2 (tj) + 6^3 

Then, as K increases indefinitely, 

(5.8) \\C{Wk)- C{Zk)\\^Q. 

Again, being a finite-dimensional version of Theorem 1, this does not need an 
elaborate proof. 

In the present context, under the null hypothesis, all the T°^ have a common 
distribution over (—1, 1); this is discrete but symmetric about 0, and is completely 
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known (though could be computationally intensive if n is not too small). Let us 
denote the distinct mass points for T°j, by — 1 = ai < 02 < • • • , = 1 and let 

(5.9) T, = Po{T°fc > aL-,+i}, J = 1, . . . , L. 
Then < ti < T2 < • • • < < 1. Also, let us write 

(5.10) Yk{T,) = /(r°fe > aL-j+i). J = 1, . . . , L, fc = 1, . . . , if. 
Further, let 

K 

(5.11) Wk{tj)=Y,Yu{t,), j = 1,...,L. 

fc=i 

Also, let J ~ max{j : 1 < j < i; < 77} for some pre-assigned 77 > 0. Basically, we 
would like to pursue the distributional features of the partial sequence {VFif (tj ), j < 
J}, and incorporate Theorem 2. Note that in this way, we avoid the conventional 
assumption of a continuous null distribution of the coordinate-wise test statistics. 
Of course, if n is adequately large, the assumption of a uniform distribution of the 
p- values (under the null hypothesis) would be reasonable. For example, if we have 
a three sample situation with rii = n2 = = A then L = (12)!/(4!)'^ = 34,650 so 
that we could choose J = 1 and use the Poisson approximation. It is also possible 
to choose J ~ 2 with an appropriate cut-off point and still stick to a FWER around 
0.05. In any case, under alternatives (of stochastic ordering) the distribution of the 
r°j, will be tilted towards the right, still confined to the interval (—1, 1), and hence, 
their centering would be shifted to the right of the origin with a negatively skewed 
distribution. 

Corresponding to the known points ti <•••< tj, let us consider the partial pro- 
cess WK{Tj),j = 1, . . . , J, as defined above. Also, let us choose a set of nonnegative 
integers ri < • • • < r j in such a way that 

(5.12) Po{Wk{tj) > rj, for some j < J} = a, 

where a may not be exactly equal to a specified level (such as 0.05) but can be 
approximated very well through the above Poisson process result. If we let 

(5.13) A, = [WK{Tj)>rjl j = l,...,J, 

then (5.12) can be written as P{ Uj<7^i }i ^o that by the Bonferroni inequality, 

j<J 3<J i<J<j'<J 

(5.14) + P{A,AkAi} + ■■■ + {-!)'' P{ A, ■■■Ak}. 

l<j<k<l<J 

Next, we use the Poisson approximation to each P{Aj} wherein we use the follow- 
ing: 

(5.15) P{A,}~e-''-{^<^./H}, j>l. 

r>rj 

Further, note that WxiTj) is a nondccreasing (step) function in j so that using the 
Markov property and Theorem 2 we may evaluate P{AjAji}. Actually, we write 
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P{AjAk} = P{Aj} ■ P{Ak\Aj}, for k > j, and use Theorem 2 to approximate 
the conditional probabihty by P{Zk > rk\Zj > rj} where rk > rj,\/j < k. Also, 
typically terms involving more than 2 events (Aj) will be small and can usually be 
neglected. Nevertheless, even if they are not small, the Markov property embedded 
in Theorem 2 can be used to provide a good approximation. Alternatively, we may 

write P{[jj<j ^j} = 1 - ^iC\3<J ^j} ^^i^g Theorem 2, write PlCljKj ^j} as 
a J-tuple sum over Poisson distributional probabilities. For small rj^j < J, as is 
typically the case, this computation does not appear to be a formidable task. 
Led by these findings, let us now consider the following testing procedure: 
Compute the WV(Tj), j < J as above. IfWK{Tj) < rrij, Vj < J, accept the null 
hypothesis that there is no DG. On the other hand, if WK{Tj) is greater than nij 
for at least one j < J , then reject the null hypothesis that all the genes are NDG, 
and proceed to detect those genes k Cz JC as DG where 

(5.16) /C = {fc e {1, . . . , A"} : Yfc(Tj) = 1, for some j < J}. 

Note that if for some k, Yk{Tj) = 1 for some j < J, then Yk{Tji) = 1, V j' > j. 
Further, note that /C is a stochastic subset of {1, ... , K}, and R = cardinality of K. 
is a (nonncgativc) integer valued random variable. The overall significance level of 
this testing procedure is well approximated by the preassigned level a. 
Let us denote the following exclusive events by 

(5.17) Bi = Ai ; Bj^Al--- A'^_^Aj , j < J. 

Then, by definition, Aj = C\j<j Pj- With the same notation as in (4.7) — (4.11), we 
study the other measures (viz., PCER, PFER, FDR and ^'FDR). Towards this, we 
consider the nonnuU situation where Kq are NDG and Ki ~ K — Kq arc DG. To 
handle the distribution of R, the total number of rejections, wc let 

(5.18) r* = (KoT, + K,p,)/K = r, + {R\/K){/3, - r,), j > 1, 
where 

(5.19) f3,=K-' P{T:k><^L-j+i\ke {l,...,i^}-/Co}, 

keDG 

for j = 1, . . . , J. Note by arguments similar to those in Sections 3 and 4, f3j ^ 
Tj , Vj < J. We may write 

(5.20) E{mi) = E{miI{Bj)). 

Next note that the events Bj, j < J, depend on the partial process WK{Tj),j < J 
and are thereby governed by Theorem 2 with Unj = Kt*,j < J. On the other hand, 
the distribution of mi is governed by the process W^{Tj),j < J, where the drift 
function for W^{Tj)) is 1/°^ = KqTjjJ < J. Using Theorem 2 and the reproductive 
property of the Poisson distribution, we may well approximate the (conditional) 
distribution of mi, given R, by a binomial law with parameters {R, KoTj / (KoTj + 
Kij3j)) whenever Bj holds. Thus, we are able to provide a good approximation to 
the PFER by writing 



(5.21) 



£;(mi) = Ye{e{^iIR I R^Bj)RI{Bj)}. 

3<.J 
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If J = 1 , the conditional binomial law directly applies and we have the approxima- 
tion 

where the last step follows from the fact that for a Poisson variable X with parame- 
ter np, E{XI{X > r)) = npP{X > r}. For J > 2, we have to apply the conditional 
binomial law under the sets Bj, followed by the distribution of R over the sets Bj, 
and this can be done by repeated quadrature procedures. Numerical studies have 
thereby good scope. 

By construction, rejection of the null hypothesis Hq entails that R > ri and may 
even be greater than ri if Bj pertains for some j >1. As such, we do not have any 
problem in applying the original definition of FDR (in (4.10)). We write 

(5.23) FDR = E{Q) = ^ E{QI{Bj)) = ^ E{E{Q\R e Bj)I{B^)} 

3<J 3<J 

and use the conditional binomial law for each term in the right hand side. Detailed 
numerical study is planned for a future communication. 

We conclude this section with some pertinent remarks and observations. First, 
the use of the Chen-Stein theorem in a multi-state context can be done under fairly 
mild regularity conditions regarding the dependence of the genes. Secondly, by our 
choice of the rj,j < J and allowing possibly J > 1, we are not only in a position 
to allow more flexibility in the choice of statistical inference procedures but also to 
enforce the rejection of null hypothesis under a more structured setup. This allows 
us to study the FDR, etc., under more diverse setups. Further, using Kendall's tau 
statistic for each gene separately, we are in a position to allow heterogeneity of the 
gene expressions across the K genes in a completely arbitrary manner, while un- 
der the null hypothesis, the distribution of the T°^,, fc = 1, . . . ,K being completely 
known provides an easy access to the incorporation of the Chen-Stein theorem. 
Finally, instead of using Kendall's tau statistic (coordinate- wise) , it might be at- 
tractive to use more general rank statistics [ I 7]. Though the distribution-free aspect 
holds under the null hypothesis, such distributions are more complex to evaluate 
and the associated Poisson processes have more complex drift functions. Further, 
such linear rank statistics involve some design variables which assume more struc- 
ture on the Fih, k = 1, . . . , K , not necessary with the use of Kendall's tau. 
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